Skip to content

Propagate trace context for entity operations#654

Open
cgillum wants to merge 10 commits intomainfrom
fix/entity-trace-context
Open

Propagate trace context for entity operations#654
cgillum wants to merge 10 commits intomainfrom
fix/entity-trace-context

Conversation

@cgillum
Copy link
Member

@cgillum cgillum commented Mar 1, 2026

Summary

Entity operations triggered from orchestrations don't propagate distributed trace context because SendEntityMessageAction lacks a parentTraceContext field, and entity operation events (EntityOperationSignaledEvent, EntityOperationCalledEvent) also lack this field. This causes entity operations to appear as disconnected root spans rather than being properly nested under their parent orchestration in distributed tracing tools.

Changes

Proto updates (see microsoft/durabletask-protobuf#64)

  • Add parentTraceContext field to SendEntityMessageAction (field 5)
  • Add parentTraceContext field to EntityOperationSignaledEvent (field 6) and EntityOperationCalledEvent (field 8)

Trace context propagation (ProtoUtils.cs)

  • Sending side: Set ParentTraceContext on SendEntityMessageAction in ConstructOrchestratorResponse, matching the existing pattern for tasks and sub-orchestrations. Uses CreateTraceContext() which generates a random clientSpanId for Client→Server span pairing.
  • Receiving side: Extract parentTraceContext from EntityOperationSignaledEvent / EntityOperationCalledEvent in ToEntityBatchRequest, populating OperationRequest.TraceContext.

Entity Server spans (TraceHelper.cs, Processor.cs)

  • Add StartTraceActivityForEntityOperation() — creates a Server span during entity execution, parented under the Client span via the propagated trace context.
  • Wire up in OnRunEntityBatchAsync using the first operation's trace context.

Entity Client spans (TraceHelper.cs, Processor.cs)

  • Add EmitTraceActivityForEntityOperationCompleted() and EmitTraceActivityForEntityOperationFailed() — emit Client spans during orchestration replay when entity operation results arrive.
  • Add StartTraceActivityForSchedulingEntityOperation() — creates the Client span and uses SetSpanId() to match the clientSpanId from ParentTraceContext, creating the Client→Server nesting.
  • Handle EntityOperationCompleted and EntityOperationFailed events in the orchestrator newEvents processing loop.
  • Add GetEntityOperationCalledEvent() helper for correlating completion events with their original called events via requestId.

Schema and constants

  • Add EntityOperation = "entity_operation" to TraceActivityConstants.cs
  • Add EntityOperationName = "durabletask.entity.operation" to Schema.cs

Tests

  • 5 unit tests for trace context propagation on the sending side
  • 3 unit tests for trace context extraction on the receiving side

How it works

Entity operations now follow the same Client→Server span pair pattern as activities and sub-orchestrations:

  1. Orchestration sends entity messageCreateTraceContext() generates a random clientSpanId → stored in SendEntityMessageAction.ParentTraceContext
  2. Backend dispatches entity work → copies ParentTraceContext to both the entity message event and the orchestrator history event
  3. Entity worker executesOnRunEntityBatchAsync creates a Server span using the propagated trace context (clientSpanId becomes the parent)
  4. Orchestration replays with result → looks up EntityOperationCalledEvent from history → StartTraceActivityForSchedulingEntityOperation creates a Client span with SetSpanId() matching the clientSpanId
  5. Result: Client span (orchestration) → Server span (entity) — properly nested in trace viewers

Screenshots

For this fix, we used a Durable Agent scenario to validate. The source code is here. TL/DR, it's an orchestration that calls two entities and then an activity function. The activity function call was needed so that we could compare the spans but also to ensure we didn't regress activity behavior (an earlier version of this change did regress activities a bit):

Before

Here's what the scenario looked like prior to the fix.

image

Notice the entity traces don't appear at all. They can be seen as top-level unparented traces (not shown).

After

Here's what the scenario looked like after the fix.

image

🎉

Merge note

Do not merge until the protobuf change (microsoft/durabletask-protobuf#64) has been properly merged and vendored.

Related PRs

Fixes #653

Entity operations triggered from orchestrations don't propagate
distributed trace context because SendEntityMessageAction lacks a
parentTraceContext field. This causes entity operations to appear
disconnected from their parent orchestration traces.

Changes:
- Vendor updated proto with parentTraceContext field on SendEntityMessageAction
  (see microsoft/durabletask-protobuf#64)
- Set ParentTraceContext in ProtoUtils.ConstructOrchestratorResponse for
  entity message actions (matching existing behavior for tasks and
  sub-orchestrations)
- Add unit tests for entity trace context propagation

Note: A corresponding server-side change is also needed in the Durable Task
Scheduler backend to propagate SendEntityMessageAction.ParentTraceContext
to OperationRequest.traceContext on entity work items.

Fixes #653
Copilot AI review requested due to automatic review settings March 1, 2026 16:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a distributed tracing gap where entity operations dispatched from orchestrations lacked trace context propagation. Without ParentTraceContext on SendEntityMessageAction, entity spans appeared as disconnected root traces in distributed trace viewers (e.g., Aspire Dashboard, Jaeger) rather than as children of the orchestration that triggered them.

Changes:

  • Added parentTraceContext field (field 5) to the SendEntityMessageAction proto message, consistent with ScheduleTaskAction and CreateSubOrchestrationAction
  • Set ParentTraceContext = CreateTraceContext() in ProtoUtils.ConstructOrchestratorResponse for entity message actions, matching the existing behavior for activities and sub-orchestrations
  • Added unit tests verifying trace context propagation for signal/call entity, null-activity fallback, non-entity-converted actions, and unique span ID generation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/Grpc/orchestrator_service.proto Adds parentTraceContext field (index 5) to SendEntityMessageAction
src/Shared/Grpc/ProtoUtils.cs Sets ParentTraceContext on entity message actions; incidental character encoding change in unrelated comments
test/Worker/Grpc.Tests/ProtoUtilsTraceContextTests.cs New test class covering trace context propagation for entity operations

cgillum added 2 commits March 1, 2026 22:05
…quest

Add parentTraceContext fields to EntityOperationSignaledEvent and
EntityOperationCalledEvent proto messages, and extract them in
ToEntityBatchRequest to populate OperationRequest.TraceContext.

This is the receiving side of the trace context propagation: once the
DTS backend populates parentTraceContext on entity operation events
(from SendEntityMessageAction.ParentTraceContext), the SDK will
correctly extract and use the trace context for entity operations.
Copilot AI review requested due to automatic review settings March 3, 2026 22:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


You can also share your feedback on Copilot code review. Take the survey.

EntityUnlockSentEvent entityUnlockSent = 4;
}

TraceContext parentTraceContext = 5;
Copy link
Member Author

@cgillum cgillum Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: don't merge changes to this file until properly vendored (requires microsoft/durabletask-protobuf#64 to be merged).

Restore original LF line endings that were inadvertently converted to
CRLF when editing on Windows. Only the actual code changes remain in
the diff.
Copy link
Contributor

@sophiatev sophiatev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you just add a screenshot of an example trace for posterity?

Create trace spans for entity operation execution in OnRunEntityBatchAsync,
using the trace context propagated from the parent orchestration. This
completes the distributed tracing story for entities by making entity
operations visible in trace viewers.

- Add EntityOperation constant to TraceActivityConstants
- Add StartTraceActivityForEntityOperation to TraceHelper
- Create span with entity metadata tags in GrpcDurableTaskWorker.Processor
Copilot AI review requested due to automatic review settings March 4, 2026 16:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs:933

  • The entity trace activity is disposed via the using statement after CompleteEntityTaskAsync() completes, which means the span duration will include the network round-trip time of sending the response. In contrast, OnRunActivityAsync explicitly calls traceActivity?.Stop() before CompleteActivityTaskAsync() to avoid including the network transmission latency in the measured span (see the comment at line 853: "Stop the trace activity here to avoid including the completion time in the latency calculation"). The entity handler should follow the same pattern and explicitly stop the activity before calling CompleteEntityTaskAsync.
            P.EntityBatchResult response = batchResult.ToEntityBatchResult(
                completionToken,
                operationInfos?.Take(batchResult.Results?.Count ?? 0));

            await this.client.CompleteEntityTaskAsync(response, cancellationToken: cancellation);

You can also share your feedback on Copilot code review. Take the survey.

cgillum and others added 2 commits March 4, 2026 14:50
Add Client spans (emitted during orchestration replay) and Server spans
(emitted during entity execution) for entity operations. This creates
proper Client->Server span nesting in trace viewers, matching the
existing pattern used by activities and sub-orchestrations.

- TraceHelper: Add EmitTraceActivityForEntityOperationCompleted/Failed
  and StartTraceActivityForSchedulingEntityOperation (Client spans)
- Processor: Handle EntityOperationCompleted/Failed in newEvents switch
  with GetEntityOperationCalledEvent helper for request correlation
- Include before/after screenshots showing entity traces in Aspire

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Screenshots are provided separately in the PR description.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 4, 2026 22:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

cgillum and others added 2 commits March 4, 2026 15:10
- Add Schema.Task.EntityOperationName constant for entity operation tag
- Replace magic string 'durabletask.entity.operation' with constant
- Combine nested if statements for ParentTraceContext null check
- Extract entity name from TargetInstanceId (format '@name@key')
- Add comment explaining Server span trace context for entity batches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each entity operation in a batch may come from a different orchestration
with its own trace context. Create individual Server spans per operation
instead of a single span for the whole batch, so each operation's span
is correctly parented under its corresponding Client span.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 4, 2026 23:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

- Replace foreach loop with LINQ Select/OfType/ToList for trace activities
- Mark trace activities with Error status in entity-not-found path
- Use non-nullable List<Activity> initialized via LINQ expression

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andystaples
Copy link
Collaborator

@cgillum What's the story with entity lock events wrt. distributed tracing? Do we currently emit spans/traces for locking? Is it worth adding here if not?

@cgillum
Copy link
Member Author

cgillum commented Mar 5, 2026

What's the story with entity lock events wrt. distributed tracing? Do we currently emit spans/traces for locking? Is it worth adding here if not?

@andystaples I considered Lock requests as out-of-scope for this PR since it's not directly applicable to entity execution traces. However, I think it's a good future work item for us to consider since it would be helpful for users to be able to see when locks are added/removed.

string targetInstanceId = historyEvent?.EntityOperationCalled?.TargetInstanceId ?? string.Empty;

// Extract entity name from instance ID (format: "@name@key")
string entityName = targetInstanceId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have APIs to do this in EntityInstanceId if I'm not mistaken?

calledEvent.ParentTraceContext?.TraceState,
out ActivityContext parentContext))
{
newActivity.SetSpanId(parentContext.SpanId.ToString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know why we use Activity.Current.Context as the parent trace context but manually set the span ID here? Why aren't we setting the parent trace context to just be the parentContext extracted here?

I see this is done for the other methods here as well but I forget (or perhaps never understood) why

FailureDetails = new FailureDetails(frameworkException),
};

traceActivities.ForEach(a => a.SetStatus(ActivityStatusCode.Error, frameworkException.Message));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again just a question for my own understanding - in this case and the case above, we will just end up with Activities with just server spans and no corresponding client span?

newEvent.TimerFired);
break;

case P.HistoryEvent.EventTypeOneofCase.EntityOperationCompleted:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the logic for emitting client spans for an entity signal request? Maybe I'm missing something but looks like this is just for entity calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered that @philliphoff added an extensive TracingIntegrationTests class under Grpc.Integration.Tests. It would be good to give copilot the task of adding tests for the entities cases as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Entity operations don't propagate trace context from parent orchestration

4 participants